Overview

For this machine learning lab, I am using the species Canis latrans to predict presence from observations and environmental data. There are five parts to this lab:

Canis Latrans in grassland habitat. Source Gerald and Buff Corsi

Lab 1a: Explore

Canis latrans observations from GBIF

obs_csv <- file.path(dir_data, "obs.csv")
obs_geo <- file.path(dir_data, "obs.geojson")
redo    <- FALSE

if (!file.exists(obs_geo) | redo){
  # get species occurrence data from GBIF with coordinates
  (res <- spocc::occ(
    query = 'Canis latrans', 
    from = 'gbif', 
    has_coords = T,
    limit = 10000))
  
  # extract data frame from result
  df <- res$gbif$data[[1]] 
  readr::write_csv(df, obs_csv)
  
  # convert to points of observation from lon/lat columns in data frame
  obs <- df %>% 
    sf::st_as_sf(
      coords = c("longitude", "latitude"),
      crs = st_crs(4326)) %>% 
    select(prov, key) # save space (joinable from obs_csv)
  sf::write_sf(obs, obs_geo, delete_dsn=T)
}
obs <- sf::read_sf(obs_geo)
nrow(obs) # number of rows
## [1] 10000

Map of distribution of points

# show points on map
mapview::mapview(obs, map.types = "Esri.WorldImagery")

Get environmental data

dir_env <- file.path(dir_data, "env")

# set a default data directory
options(sdmpredictors_datadir = dir_env)

# choosing terrestrial
env_datasets <- sdmpredictors::list_datasets(terrestrial = TRUE, marine = FALSE)

# show table of datasets
env_datasets %>% 
  select(dataset_code, description, citation) %>% 
  DT::datatable()
# choose datasets for a vector
env_datasets_vec <- c("WorldClim", "ENVIREM")

# get layers
env_layers <- sdmpredictors::list_layers(env_datasets_vec)
DT::datatable(env_layers)
# choose layers after some inspection and perhaps consulting literature
env_layers_vec <- c("WC_alt", "WC_bio1", "WC_bio12", "ER_tri", "ER_topoWet")

# get layers
env_stack <- load_layers(env_layers_vec)

# interactive plot layers, hiding all but first (select others)
# mapview(env_stack, hide = T) # makes the html too big for Github
plot(env_stack, nc=2)

Questions

How many observations total are in GBIF for your species?

Answer: There are 37,929 total observations of Canis latrans in the GBIF database.

Did you have to perform any data cleaning steps?

Answer: I did not have to perform any data cleaning steps. At first visually, I do not see any odd observations. All observations were within the North American continent, which is the appropriate range for Canis latrans. However, when I zoomed into water regions such as the Great Lakes or the Gulf of St. Lawrence by Newfoundland - I saw some observations were in the water near the shore. At first I thought this was odd, but further research showed that Canis latrans are excellent swimmers and often swim about 0.5 miles off shore, and up to 7 or 8 miles in total. (Fun fact: Canis latrans have webbed feet!) All water observations appeared to be within 0.5 miles off shore and so I did not consider them odd and left them as is.

What environmental layers did you choose as predictors? Can you find any support for these in the literature?

Answer: I chose WC_alt, WC_bio1, WC_bio12, ER_tri, and ER_topoWet from the WorldClim and ENVIREM data sets. I found the WorldClim and ENVIREM data sets using sdmpredictors list_datasets based on the criteria of terrestrial and excluded data sets related to marine environments.

Canis latrans can be found in many different types of habitats such as forests, prairie, mountains, and even agriculture or urban areas. I chose predictors like altitude, annual mean temperature, annual precipitation, terrain roughness index, and topographic wetness to accommodate for the wide variety of habitats that Canis latrans can be found in (1, 2, 3).

Convex hull based on Canis latrans observation points

obs_hull_geo  <- file.path(dir_data, "obs_hull.geojson")
env_stack_grd <- file.path(dir_data, "env_stack.grd")

if (!file.exists(obs_hull_geo) | redo){
  # make convex hull around points of observation
  obs_hull <- sf::st_convex_hull(st_union(obs))
  
  # save obs hull
  write_sf(obs_hull, obs_hull_geo)
}
obs_hull <- read_sf(obs_hull_geo)

# show points on map
mapview(
  list(obs, obs_hull))

Applied convex hull to chosen environmental layers.

if (!file.exists(env_stack_grd) | redo){
  obs_hull_sp <- sf::as_Spatial(obs_hull)
  env_stack <- raster::mask(env_stack, obs_hull_sp) %>% 
    raster::crop(extent(obs_hull_sp))
  writeRaster(env_stack, env_stack_grd, overwrite=T)  
}
env_stack <- stack(env_stack_grd)

# show map
# mapview(obs) + 
#   mapview(env_stack, hide = T) # makes html too big for Github
plot(env_stack, nc=2)

Pseudo-Absence

Based on convex hull, create a mask of region of interest and then generate random points inside mask.

absence_geo <- file.path(dir_data, "absence.geojson")
pts_geo     <- file.path(dir_data, "pts.geojson")
pts_env_csv <- file.path(dir_data, "pts_env.csv")

if (!file.exists(absence_geo) | redo){
  # get raster count of observations
  r_obs <- rasterize(
    sf::as_Spatial(obs), env_stack[[1]], field=1, fun='count')
  
  # show map
  # mapview(obs) + 
  #   mapview(r_obs)
  
  # create mask for 
  r_mask <- mask(env_stack[[1]] > -Inf, r_obs, inverse=T)
  
  # generate random points inside mask
  absence <- dismo::randomPoints(r_mask, nrow(obs)) %>% 
    as_tibble() %>% 
    st_as_sf(coords = c("x", "y"), crs = 4326)
  
  write_sf(absence, absence_geo, delete_dsn=T)
}
absence <- read_sf(absence_geo)

# show map of presence, ie obs, and absence
mapview(obs, col.regions = "green") + 
  mapview(absence, col.regions = "gray")

Create a data table that will feed into the SDM where:

  • y is the present column with values of 1 (present) or 0 (absent).
  • x is all other columns.
if (!file.exists(pts_env_csv) | redo){

  # combine presence and absence into single set of labeled points 
  pts <- rbind(
    obs %>% 
      mutate(
        present = 1) %>% 
      select(present, key),
    absence %>% 
      mutate(
        present = 0,
        key     = NA)) %>% 
    mutate(
      ID = 1:n()) %>% 
    relocate(ID)
  write_sf(pts, pts_geo, delete_dsn=T)

  # extract raster values for points
  pts_env <- raster::extract(env_stack, as_Spatial(pts), df=TRUE) %>% 
    tibble() %>% 
    # join present and geometry columns to raster value results for points
    left_join(
      pts %>% 
        select(ID, present),
      by = "ID") %>% 
    relocate(present, .after = ID) %>% 
    # extract lon, lat as single columns
    mutate(
      #present = factor(present),
      lon = st_coordinates(geometry)[,1],
      lat = st_coordinates(geometry)[,2]) %>% 
    select(-geometry)
  write_csv(pts_env, pts_env_csv)
}
pts_env <- read_csv(pts_env_csv)

pts_env %>% 
  # show first 10 presence, last 10 absence
  slice(c(1:10, (nrow(pts_env)-9):nrow(pts_env))) %>% 
  DT::datatable(
    rownames = F,
    options = list(
      dom = "t",
      pageLength = 20))

Term Plots

pts_env %>% 
  select(-ID) %>% 
  mutate(
    present = factor(present)) %>% 
  pivot_longer(-present) %>% 
  ggplot() +
  geom_density(aes(x = value, fill = present)) + 
  scale_fill_manual(values = alpha(c("gray", "green"), 0.5)) +
  scale_x_continuous(expand=c(0,0)) +
  scale_y_continuous(expand=c(0,0)) +
  theme_bw() + 
  facet_wrap(~name, scales = "free") +
  theme(
    legend.position = c(1, 0),
    legend.justification = c(1, 0))

Lab 1b: Regress

Pairs plots using GGally

The pairs plots shows correlations between variables. Based on these plots I would consider dropping ER_topoWet and ER_tri because they are highly correlated with WC_bio1.

GGally::ggpairs(
  select(pts_env, -ID),
  aes(color = factor(present)))

Linear Model

# setup model data
d <- pts_env %>% 
  select(-ID) %>%  # remove terms we don't want to model
  tidyr::drop_na() # drop the rows with NA values
nrow(d)
## [1] 19931
# fit a linear model
mdl <- lm(present ~ ., data = d)
summary(mdl)
## 
## Call:
## lm(formula = present ~ ., data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14695 -0.41599  0.02142  0.43100  1.53519 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.412e+00  8.532e-02  16.548  < 2e-16 ***
## WC_alt      -1.851e-04  9.285e-06 -19.931  < 2e-16 ***
## WC_bio1      2.394e-02  1.870e-03  12.806  < 2e-16 ***
## WC_bio12    -2.121e-04  9.299e-06 -22.804  < 2e-16 ***
## ER_tri      -2.098e-03  1.852e-04 -11.331  < 2e-16 ***
## ER_topoWet  -9.068e-02  3.881e-03 -23.364  < 2e-16 ***
## lon          7.571e-04  2.971e-04   2.548   0.0108 *  
## lat          6.830e-03  1.420e-03   4.809 1.53e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4623 on 19923 degrees of freedom
## Multiple R-squared:  0.1456, Adjusted R-squared:  0.1453 
## F-statistic: 485.1 on 7 and 19923 DF,  p-value: < 2.2e-16
y_predict <- predict(mdl, d, type="response")
y_true    <- d$present

range(y_predict)
## [1] -0.5351872  1.1469468
range(y_true)
## [1] 0 1

Generalized Linear Model (GLM)

# fit a generalized linear model with a binomial logit link function
mdl <- glm(present ~ ., family = binomial(link="logit"), data = d)
summary(mdl)
## 
## Call:
## glm(formula = present ~ ., family = binomial(link = "logit"), 
##     data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5407  -1.0206   0.2981   1.0386   3.2494  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.289e+00  4.236e-01  10.125  < 2e-16 ***
## WC_alt      -8.787e-04  4.576e-05 -19.202  < 2e-16 ***
## WC_bio1      1.207e-01  9.403e-03  12.835  < 2e-16 ***
## WC_bio12    -1.083e-03  4.904e-05 -22.091  < 2e-16 ***
## ER_tri      -1.089e-02  9.704e-04 -11.220  < 2e-16 ***
## ER_topoWet  -4.387e-01  1.917e-02 -22.890  < 2e-16 ***
## lon          5.145e-03  1.422e-03   3.617 0.000298 ***
## lat          3.995e-02  7.150e-03   5.587 2.31e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27630  on 19930  degrees of freedom
## Residual deviance: 24454  on 19923  degrees of freedom
## AIC: 24470
## 
## Number of Fisher Scoring iterations: 4
y_predict <- predict(mdl, d, type="response")

range(y_predict)
## [1] 0.005095279 0.960343222
# show term plots
termplot(mdl, partial.resid = TRUE, se = TRUE, main = F)

Generalized Additive Model (GAM)

librarian::shelf(mgcv)
# fit a generalized additive model with smooth predictors
mdl <- mgcv::gam(
  formula = present ~ s(WC_alt) + s(WC_bio1) + 
    s(WC_bio12) + s(ER_tri) + s(ER_topoWet) + s(lon) + s(lat), 
  family = binomial, data = d)
summary(mdl)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## present ~ s(WC_alt) + s(WC_bio1) + s(WC_bio12) + s(ER_tri) + 
##     s(ER_topoWet) + s(lon) + s(lat)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.13417    0.03173  -4.228 2.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                 edf Ref.df Chi.sq  p-value    
## s(WC_alt)     7.702  8.534 736.98  < 2e-16 ***
## s(WC_bio1)    7.201  7.808 601.92  < 2e-16 ***
## s(WC_bio12)   6.432  7.054 826.53  < 2e-16 ***
## s(ER_tri)     8.714  8.929  60.00  < 2e-16 ***
## s(ER_topoWet) 8.823  8.986  38.83 7.13e-06 ***
## s(lon)        8.525  8.931 572.01  < 2e-16 ***
## s(lat)        8.844  8.990 269.12  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.402   Deviance explained = 34.9%
## UBRE = -0.091154  Scale est. = 1         n = 19931

GAM Term Plots

# show term plots
plot(mdl, scale=0)

Question

Which GAM environmental variables, and even range of values, seem to contribute most towards presence (above 0 response) versus absence (below 0 response)?

Answer: The GAM shows that all environmental variables are statistically significant and contribute a lot to presence. Latitude and ER_topoWet appear to contribute the most to presence.

Maxtent (maximum entropy)

# load extra packages
librarian::shelf(
  maptools, sf)

# show version of maxent
maxent()
## This is MaxEnt version 3.4.3
# get environmental rasters
# NOTE: the first part of Lab 1. SDM - Explore got updated to write this clipped environmental raster stack
env_stack_grd <- file.path(dir_data, "env_stack.grd")
env_stack <- stack(env_stack_grd)
plot(env_stack, nc=2)

Maxtent model output

# get presence-only observation points (maxent extracts raster values for you)
obs_geo <- file.path(dir_data, "obs.geojson")
obs_sp <- read_sf(obs_geo) %>% 
  sf::as_Spatial() # maxent prefers sp::SpatialPoints over newer sf::sf class

# fit a maximum entropy model
mdl <- maxent(env_stack, obs_sp)
## This is MaxEnt version 3.4.3

Maxtent variable contribution plot

# plot variable contributions per predictor
plot(mdl)

Maxtent term plots

# plot term plots
response(mdl)

# predict
y_predict <- predict(env_stack, mdl) #, ext=ext, progress='')

plot(y_predict, main='Maxent, raw prediction')
data(wrld_simpl, package="maptools")
plot(wrld_simpl, add=TRUE, border='dark grey')

Question

Which Maxent environmental variables, and even range of values, seem to contribute most towards presence (closer to 1 response) and how might this differ from the GAM results?

Answer: WC_alt, WC_bio1, and WC_bio12 seem to contribute the most toward presence, while ER_tri and ER_topoWet seem to contribute the least. Important to note that WC_bio1 contributes far more than any other variables (greater than 60%). This differs significantly from the GAM results because….

References

  1. Chamberlain, Michael J., et al. “Spatial-Use Patterns, Movements, and Interactions among Adult Coyotes in Central Mississippi.” Canadian Journal of Zoology, vol. 78, no. 12, Dec. 2000, pp. 2087–95. DOI.org (Crossref), https://doi.org/10.1139/z00-154.

  2. Gese, Eric M., et al. “Home Range and Habitat Use of Coyotes in Southeastern Colorado.” The Journal of Wildlife Management, vol. 52, no. 4, Oct. 1988, p. 640. DOI.org (Crossref), https://doi.org/10.2307/3800923.

  3. Hinton, Joseph W., et al. “Space Use and Habitat Selection by Resident and Transient Coyotes (Canis Latrans).” PLOS ONE, edited by Marco Apollonio, vol. 10, no. 7, July 2015, p. e0132203. DOI.org (Crossref), https://doi.org/10.1371/journal.pone.0132203.